1 Data Dictionary

For the purposes of this Project the following variables and definitions are important.

The population in this dataset is the incoming cohort of students in Fall of 2019 and 2020. These students are first time degree or certificate seekers and have no prior tertiary education. They may have earned AP credits in highschool.

Fall2019 refers to the incoming freshman cohort in Fall2019. This is term year 2020.
Fall2020 refers to the incoming freshman cohort in Fall2020. This is term year 2021.

Variables of Interest: term year Incoming students in Fall2019 are assigned to term year 2020. Incoming students in Fall 2020 are assigned to term year 2021.
hours_earned: refers to credit hours the student has earned in their first Fall semester ( this can include credits earned in Summer school second session- Summer 1 and AP credits earned in high school).
hours_attempted: refers to credit and non credit hours the student has attempted in their first Fall semester ( this may include credits attempted in Summerschool second session - Summer 1).
full_part: is the student full-time (FT) or part-time (PT). Part time students are registered in less than 12 credit hours. Full-time students take at least 12 credits. major: degree programme student is registered for or certificate&LR ( letter of recommendation.) All certificates and letters of recommendations have been grouped together.
hours_earned_rate: Ratio of hours_earned/hours_attempted age: Age of student at start of program.
race: Racial classification of student. sex: Gender classification of student. high_school: Name of highschool student graduted from. Public High schools in Montgomery county are classified as MCPS. pell: Whether the student receives a pell grant or not.

2 Data Wrangling

2.1 Import Data

#import data for Fall 2016 to Fall 2020
# set directory
setwd('/Users/ulita/Documents/CAPSTONE205/Capstone/Capstone_Data/Raw_Data')

initial_df_2019<-read_xlsx("FTEIC F19.xlsx")
initial_df_2020<-read_xlsx("FTEIC F20.xlsx")

Read in files from dataMontgomery

#Access list of all highschools in Montgomery County. This will be used to filter highschools in student dataset.

r <- GET('https://data.montgomerycountymd.gov/resource/bfcv-pyzh.json')

#glimpse(r)
#content(r, "text")

jsonRespText <- content(r, as="text")
jsonRespParsed <- content(r, as="parsed")
#jsonRespParsed

df_MC <- fromJSON(jsonRespText)
#head(df_MC)
#class(df_MC)
df_MCPS <- as_tibble(df_MC)

2.2 Create Dataframe

 #create dataframe to compare credits in Fall 2019 and Fall 2020 :    df_credits 
df_initial<- bind_rows(initial_df_2019,initial_df_2020)

head(df_initial)
## # A tibble: 6 x 50
##   `U-Number` Sex   IPEDS_RACE_DESC `Age at Start` HIGH_SCHOOL FULL_PART CITY 
##        <dbl> <chr> <chr>           <chr>          <chr>       <chr>     <chr>
## 1   20190001 F     Hispanic        18 - 20        Out of Sta… PT        Gait…
## 2   20190002 M     Hispanic        18 - 20        Walter Joh… FT        Rock…
## 3   20190003 F     Multi-Race      18 - 20        Wheaton Hi… PT        Gait…
## 4   20190004 M     Asian           18 - 20        Thomas Spr… PT        Gait…
## 5   20190005 M     Hispanic        18 - 20        FOREIGN / … FT        Chev…
## 6   20190006 F     Hispanic        18 - 20        Springbroo… FT        Silv…
## # … with 43 more variables: STAT_CODE <chr>, ZIP <dbl>, PELL_GRANT <chr>,
## #   CAMP_CODE <chr>, Major <chr>, HOURS_ATTEMPTED <chr>, HOURS_EARNED <chr>,
## #   MC_GPA <dbl>, PASS_ENGL <chr>, PASS_MATH <chr>, term_year <chr>,
## #   TERM_YEAR1_10 <chr>, TERM_YEAR1_20 <chr>, TERM_YEAR1_24 <chr>,
## #   TERM_YEAR1_25 <chr>, TERM_YEAR1_30 <chr>, TERM_YEAR1_40 <chr>,
## #   TERM_YEAR2_10 <chr>, TERM_YEAR2_20 <chr>, TERM_YEAR2_24 <chr>,
## #   TERM_YEAR2_25 <chr>, TERM_YEAR2_30 <chr>, TERM_YEAR2_40 <chr>,
## #   TERM_YEAR3_10 <chr>, TERM_YEAR3_20 <chr>, TERM_YEAR3_24 <chr>,
## #   TERM_YEAR3_25 <chr>, TERM_YEAR3_30 <chr>, TERM_YEAR3_40 <chr>,
## #   TERM_YEAR4_10 <chr>, TERM_YEAR4_20 <chr>, TERM_YEAR4_24 <chr>,
## #   TERM_YEAR4_25 <chr>, TERM_YEAR4_30 <chr>, TERM_YEAR4_40 <chr>,
## #   TERM_YEAR5_10 <chr>, TERM_YEAR5_20 <chr>, TERM_YEAR5_24 <chr>,
## #   TERM_YEAR5_25 <chr>, TERM_YEAR5_30 <chr>, first_grad_date <lgl>,
## #   first_degree <lgl>, TERM_YEAR5_40 <chr>

2.3 Select Columns of Interest & Format Names

#clean names using janitor package
df_initial<-df_initial %>%
             clean_names()

#drop columns not interested in

df_initial<-df_initial %>%
             select(c(1:20))

# rename columns to make it user friendly
df_initial<-df_initial %>%
            rename("age"= "age_at_start", "race"= "ipeds_race_desc", 'summer1'= "term_year1_10", 'fall'='term_year1_20')

2.4 Overview of Data

# summary statistics of dataframing using function skimr creates a table of summary statistics of all variables in the dataframe
skim(df_initial)
Data summary
Name df_initial
Number of rows 7519
Number of columns 20
_______________________
Column type frequency:
character 17
numeric 3
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
sex 0 1.00 1 1 0 4 0
race 0 1.00 5 22 0 9 0
age 0 1.00 4 7 0 5 0
high_school 0 1.00 7 30 0 172 0
full_part 0 1.00 2 2 0 2 0
city 21 1.00 5 19 0 132 0
stat_code 21 1.00 2 2 0 16 0
pell_grant 0 1.00 1 1 0 2 0
camp_code 154 0.98 1 1 0 6 0
major 33 1.00 1 61 0 35 0
hours_attempted 87 0.99 2 4 0 55 0
hours_earned 87 0.99 2 4 0 56 0
pass_engl 0 1.00 1 1 0 2 0
pass_math 0 1.00 1 1 0 2 0
term_year 0 1.00 4 4 0 2 0
summer1 50 0.99 1 1 0 1 0
fall 50 0.99 1 1 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
u_number 0 1.00 20196656.07 5028.73 20190001 20191880.5 20193760.0 20201708.5 20203588 ▇▃▁▂▇
zip 21 1.00 20884.18 1518.62 1460 20853.0 20877.0 20902.0 94025 ▁▇▁▁▁
mc_gpa 87 0.99 2.16 1.48 0 0.5 2.5 3.5 4 ▆▂▃▅▇

There are 18 variables and 7519 rows of data. Each row of data represents 1 student. There are 3931 rows of data for Fall2019 and 3588 rows of data for Fall2020. Need to change the data types for hours_attempted and hours_earned to numeric. u_number should be a charater. Most of the missing values are in camp_code. For the purposes of this exercise, this variable is not important. There are also 87 missing values in hours attempted and hours earned. As these two are the variables of interest and GPA are of interest, I will drop the rows in these variables with missing values.

2.5 Change Datatypes

# Change datatypes
df_initial$hours_attempted<- as.integer(df_initial$hours_attempted)
df_initial$hours_earned<- as.integer(df_initial$hours_earned)
df_initial$mc_gpa<- as.double(df_initial$mc_gpa)
df_initial$u_number<- as.character(df_initial$u_number)
df_initial$zip<- as.character(df_initial$zip)

#drop missing values in hours_attempted, hours_earned and mc_gap

df_initial <- df_initial %>%
              drop_na(hours_attempted,hours_earned)

Summarise data again

skim(df_initial)
Data summary
Name df_initial
Number of rows 7432
Number of columns 20
_______________________
Column type frequency:
character 17
numeric 3
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
u_number 0 1.00 8 8 0 7432 0
sex 0 1.00 1 1 0 4 0
race 0 1.00 5 22 0 9 0
age 0 1.00 4 7 0 5 0
high_school 0 1.00 7 30 0 171 0
full_part 0 1.00 2 2 0 2 0
city 20 1.00 5 19 0 131 0
stat_code 20 1.00 2 2 0 16 0
zip 20 1.00 4 5 0 166 0
pell_grant 0 1.00 1 1 0 2 0
camp_code 152 0.98 1 1 0 6 0
major 33 1.00 1 61 0 35 0
pass_engl 0 1.00 1 1 0 2 0
pass_math 0 1.00 1 1 0 2 0
term_year 0 1.00 4 4 0 2 0
summer1 8 1.00 1 1 0 1 0
fall 8 1.00 1 1 0 1 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
hours_attempted 0 1 12.29 6.27 0 9.0 12.0 15.0 54 ▆▇▁▁▁
hours_earned 0 1 7.72 7.40 0 3.0 6.0 12.0 54 ▇▃▁▁▁
mc_gpa 0 1 2.16 1.48 0 0.5 2.5 3.5 4 ▆▂▃▅▇

After removing the missing values in hours_earned and hours_attempted. There are 8 missing values in Fall. Closer inspection shows that these 8 students have a GPA of 0 even though there are hours attempted and hours earned. I will remove these students.

# remove students whose Fall term is NA

df_initial <- df_initial%>%
              filter(., !is.na(fall))
              
skim(df_initial)
Data summary
Name df_initial
Number of rows 7424
Number of columns 20
_______________________
Column type frequency:
character 17
numeric 3
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
u_number 0 1.00 8 8 0 7424 0
sex 0 1.00 1 1 0 4 0
race 0 1.00 5 22 0 9 0
age 0 1.00 4 7 0 5 0
high_school 0 1.00 7 30 0 170 0
full_part 0 1.00 2 2 0 2 0
city 20 1.00 5 19 0 131 0
stat_code 20 1.00 2 2 0 16 0
zip 20 1.00 4 5 0 166 0
pell_grant 0 1.00 1 1 0 2 0
camp_code 152 0.98 1 1 0 6 0
major 33 1.00 1 61 0 35 0
pass_engl 0 1.00 1 1 0 2 0
pass_math 0 1.00 1 1 0 2 0
term_year 0 1.00 4 4 0 2 0
summer1 0 1.00 1 1 0 1 0
fall 0 1.00 1 1 0 1 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
hours_attempted 0 1 12.29 6.27 0 9.0 12.0 15.0 54 ▆▇▁▁▁
hours_earned 0 1 7.72 7.40 0 3.0 6.0 12.0 54 ▇▃▁▁▁
mc_gpa 0 1 2.16 1.48 0 0.5 2.5 3.5 4 ▆▂▃▅▇

2.6 Missing Values and data Anomalies

df_initial%>% group_by(term_year,full_part)%>%
                      count(hours_attempted==0)
## # A tibble: 7 x 4
## # Groups:   term_year, full_part [4]
##   term_year full_part `hours_attempted == 0`     n
##   <chr>     <chr>     <lgl>                  <int>
## 1 2020      FT        FALSE                   2251
## 2 2020      FT        TRUE                       3
## 3 2020      PT        FALSE                   1606
## 4 2020      PT        TRUE                      27
## 5 2021      FT        FALSE                   2147
## 6 2021      PT        FALSE                   1373
## 7 2021      PT        TRUE                      17

Term year 2020: There are 3 full time and 27 part time students who attempted zero hours. Term year 2021: There are 17 part time students who attempted zero hours. These 47 observations will be dropped.

# remove students who attempted 0 hours

df_initial <- df_initial%>%
              filter(., hours_attempted != 0)
              
skim(df_initial)
Data summary
Name df_initial
Number of rows 7377
Number of columns 20
_______________________
Column type frequency:
character 17
numeric 3
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
u_number 0 1.00 8 8 0 7377 0
sex 0 1.00 1 1 0 4 0
race 0 1.00 5 22 0 9 0
age 0 1.00 4 7 0 5 0
high_school 0 1.00 7 30 0 168 0
full_part 0 1.00 2 2 0 2 0
city 19 1.00 5 19 0 130 0
stat_code 19 1.00 2 2 0 16 0
zip 19 1.00 4 5 0 164 0
pell_grant 0 1.00 1 1 0 2 0
camp_code 152 0.98 1 1 0 6 0
major 33 1.00 1 61 0 35 0
pass_engl 0 1.00 1 1 0 2 0
pass_math 0 1.00 1 1 0 2 0
term_year 0 1.00 4 4 0 2 0
summer1 0 1.00 1 1 0 1 0
fall 0 1.00 1 1 0 1 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
hours_attempted 0 1 12.36 6.21 1 9.00 12.0 15.0 54 ▆▇▁▁▁
hours_earned 0 1 7.77 7.39 0 3.00 6.0 12.0 54 ▇▃▁▁▁
mc_gpa 0 1 2.18 1.48 0 0.67 2.5 3.5 4 ▆▂▃▅▇

Check missing values in variable major

#check majors list for undeclared and missing values.

df_initial%>% count(major)
## # A tibble: 36 x 2
##    major                        n
##    <chr>                    <int>
##  1 0                           16
##  2 American Sign Language      12
##  3 Applied Geography           10
##  4 Architectural Technology    74
##  5 Art                        108
##  6 Broadcast Media             29
##  7 Building Trade              88
##  8 Business                   879
##  9 Certificate & LR           221
## 10 Communication               68
## # … with 26 more rows
dim(df_initial)
## [1] 7377   20

There are 72 students with undeclared majors. 16 students with majors listed as 0 and 33 listed as NA. Due to privacy reasons, students have been grouped so as not to identify individual students. Degree Major groupings with less than 10 students have been listed as NA or 0. At this point in time I will not remove these data points, because for the purposes of this study, Major names are not required. I will change the NA in major to ‘Field2’ and the 0 to Field1. Students enrolled for a certificate or Letter of Recommendation have been grouped togther irrespective of field.

3 Initial Data Analysis

There are 221 Certificate & LR students in the initial dataset. I will drop these non-degree programmes and focus on degree majors only.

Create dataframe with degrees only. I have retained the majors grouped as NA.

#remove certificates, NA refers to degree majors which have not been named to protect student privacy.
df_Degrees<-subset(df_initial, major!="Certificate & LR")
df_Degrees%>%count(major)
## # A tibble: 34 x 2
##    major                        n
##    <chr>                    <int>
##  1 0                           16
##  2 American Sign Language      12
##  3 Applied Geography           10
##  4 Architectural Technology    74
##  5 Art                        108
##  6 Broadcast Media             29
##  7 Building Trade              88
##  8 Business                   879
##  9 Communication               68
## 10 Computer Applications       81
## # … with 24 more rows

save dataframe Degrees

write.csv(df_Degrees,'df_Degrees.csv')

3.1 Demographics of Degree Seeking Students

3.1.1 Full time versus Part-time Degree Students

#count number of full time versus part-time degree students.

df_Degrees%>% group_by(term_year)%>%
              count(full_part)
## # A tibble: 4 x 3
## # Groups:   term_year [2]
##   term_year full_part     n
##   <chr>     <chr>     <int>
## 1 2020      FT         2222
## 2 2020      PT         1527
## 3 2021      FT         2088
## 4 2021      PT         1286
# change in overall student population from 2020 to 2021

df_Degrees%>%
          group_by(term_year,full_part)%>%
          count(full_part)%>%
          group_by(term_year)%>%
          mutate(total_pop =sum(n))%>%
          group_by(full_part)%>%
          arrange(term_year,.by_group=TRUE)%>%
          mutate(pct_change= (n-lag(n))/lag(n)*100) 
## # A tibble: 4 x 5
## # Groups:   full_part [2]
##   term_year full_part     n total_pop pct_change
##   <chr>     <chr>     <int>     <int>      <dbl>
## 1 2020      FT         2222      3749      NA   
## 2 2021      FT         2088      3374      -6.03
## 3 2020      PT         1527      3749      NA   
## 4 2021      PT         1286      3374     -15.8
# decrease in total student population
((2251+1606)-(2147+1373))/(2251+1606)*100
## [1] 8.737361

Degree Seeking students: There was an 8.7% decrease in the incoming cohort ( part time and full time) size from 2020 to 2021. In term year 2020, there were 2222 full time degree seeking students and 1527 part time degree seeking students. In term year 2021, there were 2088 full time degree seeking students and 1286 part time degree seeking students. There is a 6 % drop in full time enrollment from 2020 to 2021. There is a 15.78 % drop in part time enrollment from 2020 to 2021.

Frequency of Students Part time versus Full tim: 2020 vs 2021

# Number of students part time abnd full time  2020 vs 2021
ggplot(data=df_Degrees, aes(x=full_part, fill=full_part)) +
      geom_bar() +
      geom_text(stat='count', aes(label=..count..), vjust=2,size=3)+
      facet_wrap(~term_year)+
      ggtitle("Number of Students Full time versus Part time")+
      ylab('Frequency')+
      xlab("")+
      theme(axis.text.x=element_blank(),strip.background = element_blank(),panel.grid = element_blank())

Proportion of Students Full time versus Part time: 2020 vs 2021

df_Degrees %>% 
    group_by(term_year) %>% 
    count(full_part) %>% 
    mutate(prop = n/sum(n)) %>% 
    ggplot(aes(x = full_part, y = prop)) +
    geom_col(aes(fill = full_part), position = "dodge") +
    geom_text(aes(label = scales::percent(prop), 
                  y = prop, 
                  group = full_part),
              position = position_dodge(width = 0.9),
              vjust = 2,size=3)+
   facet_wrap(~term_year)+
      ggtitle("Proportion of Students Full time versus Part time")+
      ylab('Percentage')+
      xlab("")+
      theme(axis.text.x=element_blank(),strip.background = element_blank(),panel.grid = element_blank())

Approximately 3/5 of students in each term year are full time and 2/5 are part time.

3.1.2 Race

Count of Race Groups

ggplot(data=df_Degrees, aes(x=race, fill=race)) +
      geom_bar() +
      geom_text(stat='count', aes(label=..count..), vjust=0,size=3)+
      facet_wrap(~term_year + full_part)+
      theme(axis.text.x=element_blank(),strip.background = element_blank(),panel.grid = element_blank())+
      ggtitle("Number of Students per a Race Group")+
      xlab("Race")+
      ylab("Frequency")

Percentage of each race group in Student Population

df_Degrees %>% 
    group_by(term_year,full_part) %>% 
    count(race) %>% 
    mutate(prop = n/sum(n)) %>% 
    ggplot(aes(x = race, y = prop)) +
    geom_col(aes(fill = race), position = "dodge") +
    geom_text(aes(label = scales::percent(prop,0.1), 
                  y = prop, 
                  group = race),
              position = position_dodge(width = 0.9),
              vjust = 0,size=3)+
    facet_wrap(~term_year + full_part)+
      ggtitle("Race Proportion of Students: Full time versus Part time")+
      ylab('Proportion ')+
      xlab("")+
      theme(axis.text.x=element_blank(),strip.background = element_blank(),panel.grid = element_blank())

Hispanic students form the largest proportion in the full time and part time student populations, followed by African American, White, Asian and Foreign students.

df_Degrees%>% 
  filter(full_part=='FT')%>%
    group_by(race,term_year)%>%
   tally() 
## # A tibble: 18 x 3
## # Groups:   race [9]
##    race                   term_year     n
##    <chr>                  <chr>     <int>
##  1 Am. Indian / AK Native 2020          6
##  2 Am. Indian / AK Native 2021          4
##  3 Asian                  2020        318
##  4 Asian                  2021        274
##  5 Black / African Am.    2020        569
##  6 Black / African Am.    2021        496
##  7 Foreign                2020        212
##  8 Foreign                2021        176
##  9 Hawaiian / Pac. Isl.   2020          6
## 10 Hawaiian / Pac. Isl.   2021          3
## 11 Hispanic               2020        653
## 12 Hispanic               2021        722
## 13 Multi-Race             2020         81
## 14 Multi-Race             2021         77
## 15 Unknown                2020         19
## 16 Unknown                2021          5
## 17 White                  2020        358
## 18 White                  2021        331

Full time student: Change in enrollment from 2020 to 2021 based on Race

# calculate percentage change in full time student enrollment from 2020 to 2021 by  race

df_Degrees%>%
          filter(full_part=="FT")%>%
          group_by(term_year,race)%>%
          count(race)%>%
          group_by(race)%>%
          arrange(term_year,.by_group=TRUE)%>%
          mutate(pct_change= (n-lag(n))/lag(n)*100) 
## # A tibble: 18 x 4
## # Groups:   race [9]
##    term_year race                       n pct_change
##    <chr>     <chr>                  <int>      <dbl>
##  1 2020      Am. Indian / AK Native     6      NA   
##  2 2021      Am. Indian / AK Native     4     -33.3 
##  3 2020      Asian                    318      NA   
##  4 2021      Asian                    274     -13.8 
##  5 2020      Black / African Am.      569      NA   
##  6 2021      Black / African Am.      496     -12.8 
##  7 2020      Foreign                  212      NA   
##  8 2021      Foreign                  176     -17.0 
##  9 2020      Hawaiian / Pac. Isl.       6      NA   
## 10 2021      Hawaiian / Pac. Isl.       3     -50   
## 11 2020      Hispanic                 653      NA   
## 12 2021      Hispanic                 722      10.6 
## 13 2020      Multi-Race                81      NA   
## 14 2021      Multi-Race                77      -4.94
## 15 2020      Unknown                   19      NA   
## 16 2021      Unknown                    5     -73.7 
## 17 2020      White                    358      NA   
## 18 2021      White                    331      -7.54

Full time students: In 2021 there was a 10.57% increase of Hispanic students when compared to 2020.The was a 12.83 % decrease in African American Students, 13.84 % decrease in Asian Students, a 7,54% decrease in white students and a 17.81% decrease in foreign students.

Part time student: Change in enrollment from 2020 to 2021 based on Race

df_Degrees%>% 
  filter(full_part=='PT')%>%
    group_by(race,term_year)%>%
   tally() 
## # A tibble: 18 x 3
## # Groups:   race [9]
##    race                   term_year     n
##    <chr>                  <chr>     <int>
##  1 Am. Indian / AK Native 2020          6
##  2 Am. Indian / AK Native 2021          2
##  3 Asian                  2020        132
##  4 Asian                  2021        101
##  5 Black / African Am.    2020        404
##  6 Black / African Am.    2021        352
##  7 Foreign                2020        167
##  8 Foreign                2021        101
##  9 Hawaiian / Pac. Isl.   2020          2
## 10 Hawaiian / Pac. Isl.   2021          4
## 11 Hispanic               2020        491
## 12 Hispanic               2021        396
## 13 Multi-Race             2020         51
## 14 Multi-Race             2021         51
## 15 Unknown                2020         16
## 16 Unknown                2021          4
## 17 White                  2020        258
## 18 White                  2021        275

Part time student: Change in enrollment from 2020 to 2021 based on Race

# calculate percentage change in part time student enrollment from 2020 to 2021 by  race

df_Degrees%>%
          filter(full_part=="PT")%>%
          group_by(term_year,race)%>%
          count(race)%>%
          group_by(race)%>%
          arrange(term_year,.by_group=TRUE)%>%
          mutate(pct_change= (n-lag(n))/lag(n)*100) 
## # A tibble: 18 x 4
## # Groups:   race [9]
##    term_year race                       n pct_change
##    <chr>     <chr>                  <int>      <dbl>
##  1 2020      Am. Indian / AK Native     6      NA   
##  2 2021      Am. Indian / AK Native     2     -66.7 
##  3 2020      Asian                    132      NA   
##  4 2021      Asian                    101     -23.5 
##  5 2020      Black / African Am.      404      NA   
##  6 2021      Black / African Am.      352     -12.9 
##  7 2020      Foreign                  167      NA   
##  8 2021      Foreign                  101     -39.5 
##  9 2020      Hawaiian / Pac. Isl.       2      NA   
## 10 2021      Hawaiian / Pac. Isl.       4     100   
## 11 2020      Hispanic                 491      NA   
## 12 2021      Hispanic                 396     -19.3 
## 13 2020      Multi-Race                51      NA   
## 14 2021      Multi-Race                51       0   
## 15 2020      Unknown                   16      NA   
## 16 2021      Unknown                    4     -75   
## 17 2020      White                    258      NA   
## 18 2021      White                    275       6.59

Part time degree seeking students: In 2021 there was a 19.35% decrease of Hispanic part time degree seeking students when compared to 2020. The was a 12.87 % decrease in African American Students, 23.48 % decrease in Asian Students, a 6.59% increase in white students and a 39.52% decrease in foreign students.

3.1.3 Age

Age distribution of Students Part time versus Full tim: 2020 vs 2021

# Age distribution of students 2020 vs 2021
ggplot(data=df_Degrees, aes(x=age, fill=age)) +
      geom_bar() +
      geom_text(stat='count', aes(label=..count..), vjust=0,size=3)+
      facet_wrap(~term_year+full_part)+
      ggtitle("Age Distribution of Students: Full time versus Part time")+
      ylab('Frequency')+
      xlab("")+
      theme(axis.text.x=element_blank(),strip.background = element_blank(),panel.grid = element_blank())

Age distribution of Full time Students

df_Degrees%>%
          filter(full_part=="FT")%>%
          group_by(term_year,age)%>%
          count(age)%>%
          group_by(term_year)%>%
          mutate(total_pop =sum(n))%>%
          group_by(age)%>%
          arrange(term_year,.by_group=TRUE)%>%
          mutate(pct_pop= (n/total_pop*100))
## # A tibble: 10 x 5
## # Groups:   age [5]
##    term_year age         n total_pop pct_pop
##    <chr>     <chr>   <int>     <int>   <dbl>
##  1 2020      < 18      178      2222    8.01
##  2 2021      < 18      167      2088    8.00
##  3 2020      18 - 20  1878      2222   84.5 
##  4 2021      18 - 20  1776      2088   85.1 
##  5 2020      21 - 24    77      2222    3.47
##  6 2021      21 - 24    62      2088    2.97
##  7 2020      25 - 29    41      2222    1.85
##  8 2021      25 - 29    37      2088    1.77
##  9 2020      30 +       48      2222    2.16
## 10 2021      30 +       46      2088    2.20

In term year 2020 and term year 2021 ,approximately 85% of the full time and part time students are between 18-20yrs. In term year 2020, 66.86% of the part time students were between 18-20yrs and in term year 2021, 70.68% students.

Age distribution of Part time Students

df_Degrees%>%
          filter(full_part=="PT")%>%
          group_by(term_year,age)%>%
          count(age)%>%
          group_by(term_year)%>%
          mutate(total_pop =sum(n))%>%
          group_by(age)%>%
          arrange(term_year,.by_group=TRUE)%>%
          mutate(pct_pop= (n/total_pop*100))
## # A tibble: 10 x 5
## # Groups:   age [5]
##    term_year age         n total_pop pct_pop
##    <chr>     <chr>   <int>     <int>   <dbl>
##  1 2020      < 18       74      1527    4.85
##  2 2021      < 18       65      1286    5.05
##  3 2020      18 - 20  1021      1527   66.9 
##  4 2021      18 - 20   909      1286   70.7 
##  5 2020      21 - 24   149      1527    9.76
##  6 2021      21 - 24   108      1286    8.40
##  7 2020      25 - 29    99      1527    6.48
##  8 2021      25 - 29    69      1286    5.37
##  9 2020      30 +      184      1527   12.0 
## 10 2021      30 +      135      1286   10.5

The part time student population in term year 2020 and term year 2021 has a higher proportion of 21-24, 25-29 and 30+ and a lower proportion of under 18s than the full time student population.

3.1.4 Gender

Gender of Students

# Gender of students part time and full time  2020 vs 2021
ggplot(data=df_Degrees, aes(x=sex, fill=sex)) +
      geom_bar() +
      geom_text(stat='count', aes(label=..count..), vjust=1,size=3)+
      facet_wrap(~term_year+full_part)+
      ggtitle("Gender of Students: Full time versus Part time")+
      ylab('Frequency')+
      xlab("")+
      theme(axis.text.x=element_blank(),strip.background = element_blank(),panel.grid = element_blank())

Gender 0 and Gender N are unknown values, however they have not been removed from the dataset as the variables of interest are hours_attempted and hours_earned. Some students choose not to report a gender or are transgender.This is approximately 1% of the student population.

Percentage of Full time Students by Gender

df_Degrees%>%
          filter(full_part=="FT")%>%
          group_by(term_year,sex)%>%
          count(sex)%>%
          group_by(term_year)%>%
          mutate(total_pop =sum(n))%>%
          group_by(sex)%>%
          arrange(term_year,.by_group=TRUE)%>%
          mutate(pct_pop= (n/total_pop*100))
## # A tibble: 8 x 5
## # Groups:   sex [4]
##   term_year sex       n total_pop pct_pop
##   <chr>     <chr> <int>     <int>   <dbl>
## 1 2020      0         4      2222  0.180 
## 2 2021      0         2      2088  0.0958
## 3 2020      F      1068      2222 48.1   
## 4 2021      F      1090      2088 52.2   
## 5 2020      M      1128      2222 50.8   
## 6 2021      M       973      2088 46.6   
## 7 2020      N        22      2222  0.990 
## 8 2021      N        23      2088  1.10

******CHECK*** In term year 2020, there was approximately 2.58% more full time male students than full time female students. In term year 2021, there was approximately 4.89% more full time female students than full time male students. It is evident that the full time male population have decreased more than the full time female student population in term year 2021.

Percentage of Part time Students by Gender

df_Degrees%>%
          filter(full_part=="PT")%>%
          group_by(term_year,sex)%>%
          count(sex)%>%
          group_by(term_year)%>%
          mutate(total_pop =sum(n))%>%
          group_by(sex)%>%
          arrange(term_year,.by_group=TRUE)%>%
          mutate(pct_pop= (n/total_pop*100))
## # A tibble: 8 x 5
## # Groups:   sex [4]
##   term_year sex       n total_pop pct_pop
##   <chr>     <chr> <int>     <int>   <dbl>
## 1 2020      0         5      1527  0.327 
## 2 2021      0         1      1286  0.0778
## 3 2020      F       793      1527 51.9   
## 4 2021      F       668      1286 51.9   
## 5 2020      M       701      1527 45.9   
## 6 2021      M       599      1286 46.6   
## 7 2020      N        28      1527  1.83  
## 8 2021      N        18      1286  1.40

CHECK

In term year 2020, there was approximately 5.67 % more part time female students than part time male students. In term year 2021, there was approximately 3.42% more part time female students than part time male students.

Calculate percentage change in full time student enrollment from 2020 to 2021 by gender

# calculate percentage change in full time student enrollment from 2020 to 2021 by  gender

df_Degrees%>%
          filter(full_part=="FT")%>%
          filter(sex=="F"|sex =="M")%>%
          group_by(term_year,sex)%>%
          count(sex)%>%
          group_by(sex)%>%
          arrange(term_year,.by_group=TRUE)%>%
          mutate(pct_change= (n-lag(n))/lag(n)*100) 
## # A tibble: 4 x 4
## # Groups:   sex [2]
##   term_year sex       n pct_change
##   <chr>     <chr> <int>      <dbl>
## 1 2020      F      1068      NA   
## 2 2021      F      1090       2.06
## 3 2020      M      1128      NA   
## 4 2021      M       973     -13.7

Full time male students: 13.74% decline in registration Full time female students: 2.06% increase in registration

Calculate percentage change in part time student enrollment from 2020 to 2021 by gender

# calculate percentage change in part time student enrollment from 2020 to 2021 by  gender

df_Degrees%>%
          filter(full_part=="PT")%>%
          filter(sex=="F"|sex =="M")%>%
          group_by(term_year,sex)%>%
          count(sex)%>%
          group_by(sex)%>%
          arrange(term_year,.by_group=TRUE)%>%
          mutate(pct_change= (n-lag(n))/lag(n)*100) 
## # A tibble: 4 x 4
## # Groups:   sex [2]
##   term_year sex       n pct_change
##   <chr>     <chr> <int>      <dbl>
## 1 2020      F       793       NA  
## 2 2021      F       668      -15.8
## 3 2020      M       701       NA  
## 4 2021      M       599      -14.6

Part time male students: 14.55% decrease in registration Part time female students: 15.76% decrease in registration

Gender and Race breakdown of full time students

# Gender and Race of full time students  2020 vs 2021

df_Degrees%>%
      filter(sex %in% c("F","M"))%>%
      filter(full_part=="FT")%>%
      ggplot(., aes(x=race, fill=race)) +
      geom_bar() +
      geom_text(stat='count', aes(label=..count..), vjust=0, size=3)+
      facet_wrap(~term_year+sex)+
      ggtitle("Gender and Race of Full time Students")+
      ylab('Frequency')+
      xlab("")+
      theme(axis.text.x=element_blank(),strip.background = element_blank(),panel.grid = element_blank())

#    theme(axis.text.x=element_blank(),strip.background = element_blank(),panel.grid = element_blank())

Gender and Race breakdown of part time students

# Gender and Race of part time students 2020 vs 2021

df_Degrees %>%
      filter(sex %in% c("F","M"))%>%
      filter(full_part=="PT")%>%
      ggplot(., aes(x=race, fill=race)) +
      geom_bar() +
      geom_text(stat='count', aes(label=..count..), vjust=0,size=3)+
      facet_wrap(~term_year+sex)+
      ggtitle("Gender and Race of Part time Students")+
      ylab('Frequency')+
      xlab("")+
      theme(axis.text.x=element_blank(),strip.background = element_blank(),panel.grid = element_blank())

Full time Student Enrollment Percentages trend by Gender and race

# calculate percentage change in student enrollment from 2020 to 2021 by race and gender

# create data frames with counts of full time students by race and gender
df_Degrees%>%
          filter(full_part=="FT")%>%
          filter(sex=="F"|sex =="M")%>%
          group_by(term_year,race,sex)%>%
          count(sex)%>%
          group_by(race,sex)%>%
          arrange(term_year,.by_group=TRUE)%>%
          mutate(pct_change= (n-lag(n))/lag(n)*100) 
## # A tibble: 36 x 5
## # Groups:   race, sex [18]
##    term_year race                   sex       n pct_change
##    <chr>     <chr>                  <chr> <int>      <dbl>
##  1 2020      Am. Indian / AK Native F         5      NA   
##  2 2021      Am. Indian / AK Native F         2     -60   
##  3 2020      Am. Indian / AK Native M         1      NA   
##  4 2021      Am. Indian / AK Native M         2     100   
##  5 2020      Asian                  F       136      NA   
##  6 2021      Asian                  F       136       0   
##  7 2020      Asian                  M       180      NA   
##  8 2021      Asian                  M       135     -25   
##  9 2020      Black / African Am.    F       265      NA   
## 10 2021      Black / African Am.    F       259      -2.26
## # … with 26 more rows

Full time male students: There was a 25% decrease in asian male students, -21.23 % decrease in african american males, a 23.91 percent decrease in foreign males, a 10.84% decrease in white males and a 2.68% increase in Hispanic males. Full time female students: -2.26 % decrease in african american females, a 13.33 percent decrease in foreign females, a 4.61% decrease in white females and a 17.63% increase in Hispanic females.

Part time Student Enrollment Percentages trend by Gender and race

# calculate percentage change in student enrollment from 2020 to 2021 by race and gender

# create data frames with counts of full time students by race and gender
df_Degrees%>%
          filter(full_part=="PT")%>%
          filter(sex=="F"|sex =="M")%>%
          group_by(term_year,race,sex)%>%
          count(sex)%>%
          group_by(race,sex)%>%
          arrange(term_year,.by_group=TRUE)%>%
          mutate(pct_change= (n-lag(n))/lag(n)*100) 
## # A tibble: 34 x 5
## # Groups:   race, sex [18]
##    term_year race                   sex       n pct_change
##    <chr>     <chr>                  <chr> <int>      <dbl>
##  1 2020      Am. Indian / AK Native F         2      NA   
##  2 2020      Am. Indian / AK Native M         4      NA   
##  3 2021      Am. Indian / AK Native M         2     -50   
##  4 2020      Asian                  F        69      NA   
##  5 2021      Asian                  F        41     -40.6 
##  6 2020      Asian                  M        59      NA   
##  7 2021      Asian                  M        60       1.69
##  8 2020      Black / African Am.    F       205      NA   
##  9 2021      Black / African Am.    F       188      -8.29
## 10 2020      Black / African Am.    M       194      NA   
## # … with 24 more rows

3.1.5 Pell grant

# Pell Grant
ggplot(data=df_Degrees, aes(x=pell_grant, fill=pell_grant)) +
      geom_bar() +
      geom_text(stat='count', aes(label=..count..), vjust=3, size=3)+
      facet_wrap(~term_year+full_part)+
      ggtitle("Pell grant")+
      ylab('Frequency')+
      xlab("")+
      theme(axis.text.x=element_blank(),strip.background = element_blank(),panel.grid = element_blank())

If the data is correct, no students received pell grants in term year 2021???!!!!

3.1.6 Highschool

df_Degrees%>%group_by(term_year,full_part) %>% 
        count(high_school)
## # A tibble: 391 x 4
## # Groups:   term_year, full_part [4]
##    term_year full_part high_school                        n
##    <chr>     <chr>     <chr>                          <int>
##  1 2020      FT        ** HOME SCHOOLED **               15
##  2 2020      FT        Academy of Holy Cross              2
##  3 2020      FT        Al-Huda School                     2
##  4 2020      FT        Albert Einstein HS & MC Art Cn    75
##  5 2020      FT        Avalon School, The - Bethesda      3
##  6 2020      FT        Barrie School - HS Div             1
##  7 2020      FT        Bethesda Chevy Chase High Schl    41
##  8 2020      FT        Bishop McNamara High School        1
##  9 2020      FT        Bladensburg High School            4
## 10 2020      FT        Bowie High School                  2
## # … with 381 more rows

Filter highschools MCPS

# high school names in student data is listed as xxx High School, whilst in MCPS it is abbreviated to HS.
# Modify MCPS dataframe to change HS to High School
# Modify MSCPS dataframe to match highschool names in MC College Student Data.

df_MCPS<- 
  df_MCPS %>%
  mutate(across("school_name", str_replace, "HS", "High School"))
#Correct discrepanies in names for MCPS highschools. 
# Reason: discrepancies in names in MC College data versus Montgomery Data.
# MC College corrections : Bethesda Chevy Chase High Schl, Thomas Sprigg Wootton High Sch, Poolesville Jr-Sr High School,Colonel Zadok Magruder HS,
#Northwest HS - Germantown, Quince Orchard Sr High School,Albert Einstein HS & MC Art Cn, John F. Kennedy High School,Springbrook Sr High School,

df_MCPS<- 
  df_MCPS %>%
  mutate(across("school_name", str_replace, "Bethesda-Chevy Chase High School", "Bethesda Chevy Chase High Schl"))
df_MCPS<- 
  df_MCPS %>%
  mutate(across("school_name", str_replace, "Thomas S Wootton High School", "Thomas Sprigg Wootton High Sch"))
df_MCPS<- 
  df_MCPS %>%
  mutate(across("school_name", str_replace, "Poolesville High School", "Poolesville Jr-Sr High School"))
df_MCPS<- 
  df_MCPS %>%
  mutate(across("school_name", str_replace, "Col Zadok Magruder High School", "Colonel Zadok Magruder HS"))
df_MCPS<- 
  df_MCPS %>%
  mutate(across("school_name", str_replace, "Northwest High School", "Northwest HS - Germantown"))
df_MCPS<- 
  df_MCPS %>%
  mutate(across("school_name", str_replace, "Quince Orchard High School", "Quince Orchard Sr High School"))
df_MCPS<- 
  df_MCPS %>%
  mutate(across("school_name", str_replace, "Albert Einstein High School", "Albert Einstein HS & MC Art Cn"))
df_MCPS<- 
  df_MCPS %>%
  mutate(across("school_name", str_replace, "John F Kennedy High School", "John F. Kennedy High School"))
df_MCPS<- 
  df_MCPS %>%
  mutate(across("school_name", str_replace, "Springbrook High School", "Springbrook Sr High School"))



 
df_MCPS
## # A tibble: 25 x 15
##    school_name zip_code location$latitu… $human_address $needs_recoding
##    <chr>       <chr>    <chr>            <chr>          <lgl>          
##  1 Walter Joh… 20814    39.0253918855    "{\"address\"… FALSE          
##  2 Bethesda C… 20814    38.9868264765    "{\"address\"… FALSE          
##  3 Walt Whitm… 20817    38.9816307564    "{\"address\"… FALSE          
##  4 Poolesvill… 20837    39.1431033716    "{\"address\"… FALSE          
##  5 Thomas Spr… 20850    39.0765821359    "{\"address\"… FALSE          
##  6 Rockville … 20851    39.0863478113    "{\"address\"… FALSE          
##  7 Richard Mo… 20852    39.0772920776    "{\"address\"… FALSE          
##  8 Winston Ch… 20854    39.0443047898    "{\"address\"… FALSE          
##  9 Colonel Za… 20855    39.1313109818    "{\"address\"… FALSE          
## 10 Sherwood H… 20860    39.148342372     "{\"address\"… FALSE          
## # … with 15 more rows, and 13 more variables: $longitude <chr>, city <chr>,
## #   `:@computed_region_vu5j_pcmz` <chr>, address <chr>,
## #   `:@computed_region_d7bw_bq6x` <chr>, `:@computed_region_kbsp_ykn9` <chr>,
## #   latitude <chr>, url <chr>, `:@computed_region_tx5f_5em3` <chr>,
## #   category <chr>, longitude <chr>, `:@computed_region_rbt8_3x7n` <chr>,
## #   phone <chr>
df_Degrees<- df_Degrees %>%
              mutate( MCPS = df_Degrees$high_school%in% df_MCPS$school_name)

df_Degrees%>%
  group_by(term_year,full_part)%>%
  filter(MCPS==TRUE)%>%
  count(high_school)
## # A tibble: 100 x 4
## # Groups:   term_year, full_part [4]
##    term_year full_part high_school                        n
##    <chr>     <chr>     <chr>                          <int>
##  1 2020      FT        Albert Einstein HS & MC Art Cn    75
##  2 2020      FT        Bethesda Chevy Chase High Schl    41
##  3 2020      FT        Clarksburg High School            76
##  4 2020      FT        Colonel Zadok Magruder HS         76
##  5 2020      FT        Damascus High School              41
##  6 2020      FT        Gaithersburg High School         133
##  7 2020      FT        James Hubert Blake High School    74
##  8 2020      FT        John F. Kennedy High School       72
##  9 2020      FT        Montgomery Blair High School     106
## 10 2020      FT        Northwest HS - Germantown         92
## # … with 90 more rows

create highschool catergories: MCPS , Foreign, Homeschool, Out of State, Other MD ( private schools and MD public schools not in MCPS), DC( District of Columbia)

df_Degrees<- df_Degrees%>%
              mutate(HS_classify= case_when(
                .$high_school == "** HOME SCHOOLED **" ~ "home_school", 
                .$high_school == "FOREIGN / INTERNATIONAL HS"~"Foreign",
                .$high_school == "G. E. D."~"GED",
                .$MCPS == TRUE~"MCPS",
                .$high_school=="DISTRICT OF COLUMBIA HS"~"DC",
                .$high_school == "Out of State High School"~"Out of State",
                TRUE~"Other In State"))

count number of MCPS graduates in cohort in term years 2020 and 2021

df_Degrees%>%group_by(term_year,full_part) %>% 
        count(MCPS)
## # A tibble: 8 x 4
## # Groups:   term_year, full_part [4]
##   term_year full_part MCPS      n
##   <chr>     <chr>     <lgl> <int>
## 1 2020      FT        FALSE   546
## 2 2020      FT        TRUE   1676
## 3 2020      PT        FALSE   642
## 4 2020      PT        TRUE    885
## 5 2021      FT        FALSE   512
## 6 2021      FT        TRUE   1576
## 7 2021      PT        FALSE   460
## 8 2021      PT        TRUE    826
df_Degrees%>%group_by(term_year,full_part) %>% 
        count(HS_classify)
## # A tibble: 28 x 4
## # Groups:   term_year, full_part [4]
##    term_year full_part HS_classify        n
##    <chr>     <chr>     <chr>          <int>
##  1 2020      FT        DC                69
##  2 2020      FT        Foreign          193
##  3 2020      FT        GED               33
##  4 2020      FT        home_school       15
##  5 2020      FT        MCPS            1676
##  6 2020      FT        Other In State   174
##  7 2020      FT        Out of State      62
##  8 2020      PT        DC                55
##  9 2020      PT        Foreign          265
## 10 2020      PT        GED               54
## # … with 18 more rows

Count of High Schools

ggplot(data=df_Degrees, aes(x=HS_classify, fill=HS_classify)) +
      geom_bar() +
      geom_text(stat='count', aes(label=..count..), vjust=0,size=3)+
      facet_wrap(~term_year + full_part)+
      theme(axis.text.x=element_blank(),strip.background = element_blank(),panel.grid = element_blank())+
      ggtitle("High Schools Students' Graduated  ")+
      xlab("High School")+
      ylab("Frequency")

The majority of full time and part time students are from MCPS highschools

Proportion of Students from Highschools

df_Degrees %>% 
    group_by(term_year,full_part) %>% 
    count(HS_classify) %>% 
    mutate(prop = n/sum(n)) %>% 
    ggplot(aes(x = HS_classify, y = prop)) +
    geom_col(aes(fill = HS_classify), position = "dodge") +
    geom_text(aes(label = scales::percent(prop,0.5), 
                  y = prop, 
                  group = HS_classify),
              position = position_dodge(width = 0.9),
              vjust = 0.25,size=3)+
    facet_wrap(~term_year + full_part)+
      ggtitle("High schools students graduated")+
      ylab('Proportion ')+
      xlab("")+
      theme(axis.text.x=element_blank(),strip.background = element_blank(),panel.grid = element_blank())

75.5% of full time degree students in term year 2020 and term year 2021 graduated from MCPS highschools. 58% of part time degree students in term year 2020 and 64% of part time students in term year 2021 graduated from MCPS highschool.

Breakdown of Highschools Full time students in term year 2020 attended in MCPS

df_Degrees%>%
          filter(full_part=="FT" & term_year=="2020")%>%
          filter(HS_classify=="MCPS")%>%
          group_by(term_year,high_school)%>%
          count(high_school)%>%
          group_by(term_year)%>%
          mutate(total_pop =sum(n))%>%
          group_by(high_school)%>%
          arrange(term_year,.by_group=TRUE)%>%
          mutate(pct_pop= (n/total_pop*100))%>%
          arrange(desc(pct_pop))
## # A tibble: 25 x 5
## # Groups:   high_school [25]
##    term_year high_school                        n total_pop pct_pop
##    <chr>     <chr>                          <int>     <int>   <dbl>
##  1 2020      Gaithersburg High School         133      1676    7.94
##  2 2020      Montgomery Blair High School     106      1676    6.32
##  3 2020      Springbrook Sr High School        93      1676    5.55
##  4 2020      Northwest HS - Germantown         92      1676    5.49
##  5 2020      Paint Branch High School          92      1676    5.49
##  6 2020      Wheaton High School               83      1676    4.95
##  7 2020      Richard Montgomery High School    77      1676    4.59
##  8 2020      Clarksburg High School            76      1676    4.53
##  9 2020      Colonel Zadok Magruder HS         76      1676    4.53
## 10 2020      Albert Einstein HS & MC Art Cn    75      1676    4.47
## # … with 15 more rows

The Graph below gives the overall proportions of full time students from highschools in MCPS in term_year 2020. As there are 25 highschools, I have not included the key. The table above give the percentages of Full time students who attended each high school. In term year 2020: 7.93% of full time students finished high school at Gaithersburg High. 6.32% from Montgomery Blair HS. 5.54% from Springbok Sr High School. 5.49% from Northwest HS - Germantown. 5.49% from Paint Branch High School. 9 highschools had a arrange between 4 and 5% On the other end of the spectrum: Walt Whitman High School at 1.13%. Poolesville Jr-Sr High School 1.31% Winston Churchill 2.03% Thomas Sprigg Wootton High Sch 2.03%

v1<- df_Degrees %>% 
    group_by(term_year,full_part) %>% 
    filter(full_part=="FT" & term_year=="2020")%>%
    filter(HS_classify=="MCPS")%>%
    count(high_school) %>% 
    mutate(prop = n/sum(n)) %>% 
    ggplot(aes(x = high_school, y = prop)) +
    geom_col(aes(fill=high_school), position = "dodge") +
    geom_text(aes(label = scales::percent(prop,0.5), 
                  y = prop, 
                  group = high_school),
              position = position_dodge(width = 0.9),
              vjust = 0, size=3, hjust=0)+
  #  facet_wrap(~term_year )+
      ggtitle("High schools full time students graduated in term year 2020 graduated")+
      ylab('Proportion ')+
      xlab("")+
      theme(legend.position = "none", axis.text.x=element_blank(),strip.background = element_blank(),panel.grid = element_blank()) 
  
v1+ coord_flip()  

Breakdown of Highschools Full time students in term year 2021 attended in MCPS

df_Degrees%>%
          filter(full_part=="FT" & term_year=="2021")%>%
          filter(HS_classify=="MCPS")%>%
          group_by(term_year,high_school)%>%
          count(high_school)%>%
          group_by(term_year)%>%
          mutate(total_pop =sum(n))%>%
          group_by(high_school)%>%
          arrange(term_year,.by_group=TRUE)%>%
          mutate(pct_pop= (n/total_pop*100))%>%
          arrange(desc(pct_pop))
## # A tibble: 25 x 5
## # Groups:   high_school [25]
##    term_year high_school                        n total_pop pct_pop
##    <chr>     <chr>                          <int>     <int>   <dbl>
##  1 2021      Montgomery Blair High School      99      1576    6.28
##  2 2021      Paint Branch High School          92      1576    5.84
##  3 2021      Gaithersburg High School          91      1576    5.77
##  4 2021      Wheaton High School               90      1576    5.71
##  5 2021      Northwest HS - Germantown         86      1576    5.46
##  6 2021      Colonel Zadok Magruder HS         84      1576    5.33
##  7 2021      Richard Montgomery High School    78      1576    4.95
##  8 2021      Clarksburg High School            75      1576    4.76
##  9 2021      Watkins Mill High School          75      1576    4.76
## 10 2021      John F. Kennedy High School       71      1576    4.51
## # … with 15 more rows
v2<-df_Degrees %>% 
    group_by(term_year,full_part) %>% 
    filter(full_part=="FT" & term_year=="2021")%>%
    filter(HS_classify=="MCPS")%>%
    count(high_school) %>% 
    mutate(prop = n/sum(n)) %>% 
    ggplot(aes(x = high_school, y = prop)) +
    geom_col(aes(fill=high_school), position = "dodge") +
    geom_text(aes(label = scales::percent(prop,0.5), 
                  y = prop, 
                  group = high_school),
              position = position_dodge(width = 0.9),
              vjust = 0, size =3,hjust=0)+
  #  facet_wrap(~term_year )+
      ggtitle("High Schools Full time students in term year 2021 graduated")+
      ylab('Proportion ')+
      xlab("")+
      theme(legend.position = "none", axis.text.x=element_blank(),strip.background = element_blank(),panel.grid = element_blank()) 
  
v2+ coord_flip()   

# calculate percentage change in full time student enrollment from 2020 to 2021 by MCPS highschool
df_Degrees%>%
          filter(full_part=="FT")%>%
          filter(HS_classify=="MCPS")%>%
          group_by(term_year,high_school)%>%
          count(high_school)%>%
          group_by(term_year)%>%
          group_by(high_school)%>%
          arrange(term_year,.by_group=TRUE)%>%
          mutate(pct_change= (n-lag(n))/lag(n)*100)
## # A tibble: 50 x 4
## # Groups:   high_school [25]
##    term_year high_school                        n pct_change
##    <chr>     <chr>                          <int>      <dbl>
##  1 2020      Albert Einstein HS & MC Art Cn    75      NA   
##  2 2021      Albert Einstein HS & MC Art Cn    66     -12   
##  3 2020      Bethesda Chevy Chase High Schl    41      NA   
##  4 2021      Bethesda Chevy Chase High Schl    44       7.32
##  5 2020      Clarksburg High School            76      NA   
##  6 2021      Clarksburg High School            75      -1.32
##  7 2020      Colonel Zadok Magruder HS         76      NA   
##  8 2021      Colonel Zadok Magruder HS         84      10.5 
##  9 2020      Damascus High School              41      NA   
## 10 2021      Damascus High School              41       0   
## # … with 40 more rows

Full time students: Gaitherburg High had a 31.58% decrease in enrollment. Poolesville Jr-Sr High School 36.36% decrease. Springbrook Sr High School a 31.18% decrease. Walter Johnson High School 28.38% decrease. Rockville High increased enrollment by 40.43%

A more detailed visualisation will be presented on Tableau in EDA.

Highs Schools Part time Students Graduated from. term year 2020 vs term year 2021

Part time students term year 2020:

q1<- df_Degrees %>% 
    group_by(term_year,full_part) %>% 
    filter(full_part=="PT" & term_year=="2020")%>%
    filter(HS_classify=="MCPS")%>%
    count(high_school) %>% 
    mutate(prop = n/sum(n)) %>% 
    ggplot(aes(x = high_school, y = prop)) +
    geom_col(aes(fill=high_school), position = "dodge") +
    geom_text(aes(label = scales::percent(prop,0.5), 
                  y = prop, 
                  group = high_school),
              position = position_dodge(width = 0.9),
              vjust = 0, size=3, hjust=0)+
  #  facet_wrap(~term_year )+
      ggtitle("High schools Part time students graduated in term year 2020 graduated")+
      ylab('Proportion ')+
      xlab("")+
      theme(legend.position = "none", axis.text.x=element_blank(),strip.background = element_blank(),panel.grid = element_blank()) 
  
q1+ coord_flip()  

Part time students term year 2021:

q2<- df_Degrees %>% 
    group_by(term_year,full_part) %>% 
    filter(full_part=="PT" & term_year=="2021")%>%
    filter(HS_classify=="MCPS")%>%
    count(high_school) %>% 
    mutate(prop = n/sum(n)) %>% 
    ggplot(aes(x = high_school, y = prop)) +
    geom_col(aes(fill=high_school), position = "dodge") +
    geom_text(aes(label = scales::percent(prop,0.5), 
                  y = prop, 
                  group = high_school),
              position = position_dodge(width = 0.9),
              vjust = 0, size=3, hjust=0)+
  #  facet_wrap(~term_year )+
      ggtitle("High schools Part time students graduated in term year 2020 graduated")+
      ylab('Proportion ')+
      xlab("")+
      theme(legend.position = "none", axis.text.x=element_blank(),strip.background = element_blank(),panel.grid = element_blank()) 
  
q2+ coord_flip()  

Breakdown of Highschools Part time students in term year 2020 attended in MCPS

df_Degrees%>%
          filter(full_part=="PT" & term_year=="2020")%>%
          filter(HS_classify=="MCPS")%>%
          group_by(term_year,high_school)%>%
          count(high_school)%>%
          group_by(term_year)%>%
          mutate(total_pop =sum(n))%>%
          group_by(high_school)%>%
          arrange(term_year,.by_group=TRUE)%>%
          mutate(pct_pop= (n/total_pop*100))%>%
          arrange(desc(pct_pop))
## # A tibble: 25 x 5
## # Groups:   high_school [25]
##    term_year high_school                        n total_pop pct_pop
##    <chr>     <chr>                          <int>     <int>   <dbl>
##  1 2020      Northwest HS - Germantown         67       885    7.57
##  2 2020      Gaithersburg High School          63       885    7.12
##  3 2020      Montgomery Blair High School      62       885    7.01
##  4 2020      John F. Kennedy High School       58       885    6.55
##  5 2020      Albert Einstein HS & MC Art Cn    47       885    5.31
##  6 2020      Watkins Mill High School          47       885    5.31
##  7 2020      Clarksburg High School            45       885    5.08
##  8 2020      Paint Branch High School          44       885    4.97
##  9 2020      Richard Montgomery High School    42       885    4.75
## 10 2020      Quince Orchard Sr High School     39       885    4.41
## # … with 15 more rows

Breakdown of Highschools Part time students in term year 2021 attended in MCPS

df_Degrees%>%
          filter(full_part=="PT" & term_year=="2021")%>%
          filter(HS_classify=="MCPS")%>%
          group_by(term_year,high_school)%>%
          count(high_school)%>%
          group_by(term_year)%>%
          mutate(total_pop =sum(n))%>%
          group_by(high_school)%>%
          arrange(term_year,.by_group=TRUE)%>%
          mutate(pct_pop= (n/total_pop*100))%>%
          arrange(desc(pct_pop))
## # A tibble: 25 x 5
## # Groups:   high_school [25]
##    term_year high_school                        n total_pop pct_pop
##    <chr>     <chr>                          <int>     <int>   <dbl>
##  1 2021      Gaithersburg High School          52       826    6.30
##  2 2021      Northwest HS - Germantown         50       826    6.05
##  3 2021      Northwood High School             47       826    5.69
##  4 2021      Montgomery Blair High School      45       826    5.45
##  5 2021      Albert Einstein HS & MC Art Cn    39       826    4.72
##  6 2021      Clarksburg High School            39       826    4.72
##  7 2021      Colonel Zadok Magruder HS         39       826    4.72
##  8 2021      Quince Orchard Sr High School     38       826    4.60
##  9 2021      Paint Branch High School          37       826    4.48
## 10 2021      Springbrook Sr High School        37       826    4.48
## # … with 15 more rows
# calculate percentage change in part time student enrollment from 2020 to 2021 by MCPS highschool
df_Degrees%>%
          filter(full_part=="PT")%>%
          filter(HS_classify=="MCPS")%>%
          group_by(term_year,high_school)%>%
          count(high_school)%>%
          group_by(term_year)%>%
          group_by(high_school)%>%
          arrange(term_year,.by_group=TRUE)%>%
          mutate(pct_change= (n-lag(n))/lag(n)*100)
## # A tibble: 50 x 4
## # Groups:   high_school [25]
##    term_year high_school                        n pct_change
##    <chr>     <chr>                          <int>      <dbl>
##  1 2020      Albert Einstein HS & MC Art Cn    47      NA   
##  2 2021      Albert Einstein HS & MC Art Cn    39     -17.0 
##  3 2020      Bethesda Chevy Chase High Schl    22      NA   
##  4 2021      Bethesda Chevy Chase High Schl    15     -31.8 
##  5 2020      Clarksburg High School            45      NA   
##  6 2021      Clarksburg High School            39     -13.3 
##  7 2020      Colonel Zadok Magruder HS         36      NA   
##  8 2021      Colonel Zadok Magruder HS         39       8.33
##  9 2020      Damascus High School              20      NA   
## 10 2021      Damascus High School              22      10   
## # … with 40 more rows

3.1.7 Majors

Overall Majors trend

Count of Majors in Full time students in 2020

z1<- df_Degrees%>%
      filter(full_part=="FT" &term_year =="2020")%>%
       ggplot(., aes(x=major, fill=major)) +
      geom_bar() +
      geom_text(stat='count', aes(label=..count..), vjust=0, hjust=0, size =3)+
      ggtitle("Majors of Full-time Students in 2020  ")+
      xlab("Major")+
      ylab("Frequency")+
    theme(legend.position = "none") 
       
z1 + coord_flip()

Count of Majors in Full time students in 2021

z13<- df_Degrees%>%
      filter(full_part=="FT" &term_year =="2021")%>%
       ggplot(., aes(x=major, fill=major)) +
      geom_bar() +
      geom_text(stat='count', aes(label=..count..), vjust=0, hjust=0, size =3)+
      ggtitle("Majors of Full-time Students in 2021  ")+
      xlab("Major")+
      ylab("Frequency")+
    theme(legend.position = "none") 
       
z13 + coord_flip()

calculate percentage change in full time student majors from 2020 to 2021

df_Degrees%>%
          filter(full_part=="FT")%>%
          group_by(term_year,major)%>%
          count(major)%>%
          group_by(term_year)%>%
          group_by(major)%>%
          arrange(term_year,.by_group=TRUE)%>%
          mutate(pct_change= (n-lag(n))/lag(n)*100)
## # A tibble: 64 x 4
## # Groups:   major [34]
##    term_year major                        n pct_change
##    <chr>     <chr>                    <int>      <dbl>
##  1 2020      0                            3       NA  
##  2 2021      0                            3        0  
##  3 2020      American Sign Language       5       NA  
##  4 2021      American Sign Language       1      -80  
##  5 2020      Applied Geography            2       NA  
##  6 2021      Applied Geography            2        0  
##  7 2020      Architectural Technology    18       NA  
##  8 2021      Architectural Technology    27       50  
##  9 2020      Art                         32       NA  
## 10 2021      Art                         27      -15.6
## # … with 54 more rows

Count of Majors in Part time students in 2020

z11<- df_Degrees%>%
      filter(full_part=="PT" &term_year =="2020")%>%
       ggplot(., aes(x=major, fill=major)) +
      geom_bar() +
      geom_text(stat='count', aes(label=..count..), vjust=0, hjust=0, size =3)+
      ggtitle("Majors of Full-time Students in 2020  ")+
      xlab("Major")+
      ylab("Frequency")+
    theme(legend.position = "none") 
       
z11 + coord_flip()

Count of Majors in Part time students in 2021

z12<- df_Degrees%>%
      filter(full_part=="PT" &term_year =="2021")%>%
       ggplot(., aes(x=major, fill=major)) +
      geom_bar() +
      geom_text(stat='count', aes(label=..count..), vjust=0, hjust=0, size =3)+
      ggtitle("Majors of Full-time Students in 2021  ")+
      xlab("Major")+
      ylab("Frequency")+
    theme(legend.position = "none") 
       
z12 + coord_flip()

calculate percentage change in part time student majors from 2020 to 2021

df_Degrees%>%
          filter(full_part=="PT")%>%
          group_by(term_year,major)%>%
          count(major)%>%
          group_by(term_year)%>%
          group_by(major)%>%
          arrange(term_year,.by_group=TRUE)%>%
          mutate(pct_change= (n-lag(n))/lag(n)*100)
## # A tibble: 63 x 4
## # Groups:   major [33]
##    term_year major                        n pct_change
##    <chr>     <chr>                    <int>      <dbl>
##  1 2020      0                            9       NA  
##  2 2021      0                            1      -88.9
##  3 2020      American Sign Language       3       NA  
##  4 2021      American Sign Language       3        0  
##  5 2020      Applied Geography            4       NA  
##  6 2021      Applied Geography            2      -50  
##  7 2020      Architectural Technology    21       NA  
##  8 2021      Architectural Technology     8      -61.9
##  9 2020      Art                         23       NA  
## 10 2021      Art                         26       13.0
## # … with 53 more rows

Majors by Gender

Majors by Race

3.2 Registration in Summer School 1

Hours attempted in Summer 1 are reported with hours_attempted in Fall. Credits earned in Summer 1 are usually reported with hours earned in Fall.

#count number of full time versus part-time degree students.

df_Degrees%>% group_by(term_year,full_part)%>%
                     count(summer1=='Y')
## # A tibble: 4 x 4
## # Groups:   term_year, full_part [4]
##   term_year full_part `summer1 == "Y"`     n
##   <chr>     <chr>     <lgl>            <int>
## 1 2020      FT        FALSE             2222
## 2 2020      PT        FALSE             1527
## 3 2021      FT        FALSE             2088
## 4 2021      PT        FALSE             1286

There are 0 students who have attended Summer1. Hence values reported for hours_earned are only for credits earned in fall and AP credits. Hours attempted in Fall are only for non credit developmental courses and credit courses taken in Fall.

3.3 Hours Attempted by Degree Seeking Students

# AP credits can be given for courses taken at highschool.
#Check number of students with hours_earned greater than hours_attempted

df_Degrees %>% count(hours_attempted<hours_earned)
## # A tibble: 2 x 2
##   `hours_attempted < hours_earned`     n
##   <lgl>                            <int>
## 1 FALSE                             7119
## 2 TRUE                                 4

There are 5 students with attempted hours smaller than hours earned. This may be due to AP credits ( credits earned in high school) and will be addressed later.

Histogram of hours_attempted by year

p <- ggplot(df_Degrees, aes(x = hours_attempted))+ geom_histogram(aes(fill = term_year))
p1<- p + facet_wrap(~full_part)

p1
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Boxplots of hours_attempted by year

p11 = ggplot(df_Degrees, aes(hours_attempted))
p11 + geom_boxplot(aes(colour = term_year)) +
       facet_wrap(~full_part)

Density plot of hours_attempted by year

ggplot(df_Degrees, aes(hours_attempted, fill = term_year)) + geom_density(alpha = 0.2) +
  facet_wrap(~full_part)+
  xlab("Hours attempted") +
  ylab( "Density")

Fivenum Summary

df_Degrees%>% group_by(term_year,full_part)%>%
  summarise(n = n(),
            min = fivenum(hours_attempted)[1],
            Q1 = fivenum(hours_attempted)[2],
            median = fivenum(hours_attempted)[3],
            Q3 = fivenum(hours_attempted)[4],
            max = fivenum(hours_attempted)[5],
            mean= mean(hours_attempted),
            sd = sd(hours_attempted))
## `summarise()` regrouping output by 'term_year' (override with `.groups` argument)
## # A tibble: 4 x 10
## # Groups:   term_year [2]
##   term_year full_part     n   min    Q1 median    Q3   max  mean    sd
##   <chr>     <chr>     <int> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2020      FT         2222     3    12     13    15    53 14.9   5.73
## 2 2020      PT         1527     1     6      8    10    51  8.20  4.82
## 3 2021      FT         2088     3    12     14    16    54 15.4   5.23
## 4 2021      PT         1286     1     6      8    10    47  8.50  4.97

Hours Attempted by Full-time students: The distributions for the hours attempted by full time students in term year 2020 and term year 2021 have similar shapes. The first quartile for both distributions is 12. The median for term year 2021 (14 hours attempted) is slightly higher than that in 2020 (13 hours attempted). The 3rd quartile for 2020 is 15, whilst in 2021 it is 16. Both distibutions are also skewed right and have similar tail lengths. The maxium for 2020 is 53 and in 2021 it is 54. The distribution for 2021 has lower peaks because there were less students enrolled in 2021 than in 2020. Peaks can be attributed to the fact that most courses are 3 or 4 credit courses. The college offers very few 2 credit courses.

Hours Attempted by Part-time students: The distributions for the hours attempted by part time students in term year 2020 and term year 2021 have similar shapes. The first quartile in 2021 is 5 hours and is slightly lower than 2020 ( 6hours). Both distibutions are skewed right. The right tail for 2021 is slightly shorter. The maxium for 2020 is 51 and in 2021 it is 47. The distribution for 2021 has lower peaks because there were less students enrolled in 2021 than in 2020. Peaks can be attributed to the fact that most courses are 3 or 4 credit courses. The college offers very few 2 credit courses.

Students who attempted less than 3 hours

df_Degrees%>% group_by(term_year,full_part)%>%
                      count(hours_attempted<3)
## # A tibble: 6 x 4
## # Groups:   term_year, full_part [4]
##   term_year full_part `hours_attempted < 3`     n
##   <chr>     <chr>     <lgl>                 <int>
## 1 2020      FT        FALSE                  2222
## 2 2020      PT        FALSE                  1488
## 3 2020      PT        TRUE                     39
## 4 2021      FT        FALSE                  2088
## 5 2021      PT        FALSE                  1270
## 6 2021      PT        TRUE                     16

Students who attempted more than 18hours

df_Degrees%>% group_by(term_year,full_part)%>%
                      count(hours_attempted>18)
## # A tibble: 8 x 4
## # Groups:   term_year, full_part [4]
##   term_year full_part `hours_attempted > 18`     n
##   <chr>     <chr>     <lgl>                  <int>
## 1 2020      FT        FALSE                   1898
## 2 2020      FT        TRUE                     324
## 3 2020      PT        FALSE                   1473
## 4 2020      PT        TRUE                      54
## 5 2021      FT        FALSE                   1798
## 6 2021      FT        TRUE                     290
## 7 2021      PT        FALSE                   1241
## 8 2021      PT        TRUE                      45

41 part time students attempted less than 3 hours in term year 2020 and 16 part time students attempted less than 3 hours in term year 2021. Students also need special permission to enroll in more than 18 credits. In term year 2020, 327 full time and 56 part time students attempted more than 18 hours. In term year 2021, 290 full time and 47 part time degree seeking students attempted more than 18 hours. As only a student who is registered for 12 or more credit hours is considered a full-time student. Students classified as part time and enrolled in more than 12hours, are enrolled in less than 12 credit hours in the Fall semester. As non credit hours and credit hours are included in the hours attempted variable, I will use the hours earned variable to filter the datasets and remove outliers.

3.4 Hours Earned by Degree Seeking Students

Histogram of hours_earned by year

p <- ggplot(df_Degrees, aes(x = hours_earned))+ geom_histogram(aes(fill = term_year))
p1<- p + facet_wrap(~full_part)

p1
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Boxplots of hours_earned by year

p11 = ggplot(df_Degrees, aes(hours_earned))
p11 + geom_boxplot(aes(colour = term_year)) +
       facet_wrap(~full_part)

Density plot of hours_earned by year

ggplot(df_Degrees, aes(hours_earned, fill = term_year)) + geom_density(alpha = 0.2) +
  facet_wrap(~full_part)+
  xlab("Hours Earned") +
  ylab( "Density")

Hours Earned Fivenum Summary

df_Degrees%>% group_by(term_year,full_part)%>%
  summarise(n = n(),
            min = fivenum(hours_earned)[1],
            Q1 = fivenum(hours_earned)[2],
            median = fivenum(hours_earned)[3],
            Q3 = fivenum(hours_earned)[4],
            max = fivenum(hours_earned)[5],
            mean= mean(hours_earned),
            sd = sd(hours_earned))
## `summarise()` regrouping output by 'term_year' (override with `.groups` argument)
## # A tibble: 4 x 10
## # Groups:   term_year [2]
##   term_year full_part     n   min    Q1 median    Q3   max  mean    sd
##   <chr>     <chr>     <int> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2020      FT         2222     0     6      9    13    53 10.4   7.59
## 2 2020      PT         1527     0     0      3     6    51  3.77  5.33
## 3 2021      FT         2088     0     6     10    13    54 10.3   7.41
## 4 2021      PT         1286     0     0      3     6    47  4.37  5.56

QQplot of hours earned by Fulltime Students.

d1<- df_Degrees%>%filter(full_part=="FT")

p<-qplot(sample = hours_earned, data = d1, color=term_year)
p

Hours Earned Summary The distributions appeared to be skewed to the right. Incoming students are allowed to get AP credits for courses taken at highschool. The distributions for part-time student in 2020 and 2021 appear to follow a similar distribution. Peaks can be attributed to the fact that most courses are 3 or 4 credit courses. The college offers very few 2 credit courses. Students also need special permission to enroll in more than 18credits. Fulltime Students: The first quartile in term year 2020 and term year 2021 is 6 credits. The median for term year 2021 ( 10 hours earned) is slightly higher than the median in 2020 ( 9 hours earned). Both distributions has long tails on the right, with 2020 havinga maximum of 53 and 2021 having a maxium of 54. Part time Students: Both 2020 and 2021 follow a similar distribution. The first quartile for both years is 0. This means that 25% of part time students earn 0 credits. The median for both distributions is 3, ie. 50% of students earn up to 3 credits. The 3rd quartile for both distributions is 6. Both distributions have long tails.

Students who earned less than 3 hours

df_Degrees%>% group_by(term_year,full_part)%>%
                      count(hours_earned<3)
## # A tibble: 8 x 4
## # Groups:   term_year, full_part [4]
##   term_year full_part `hours_earned < 3`     n
##   <chr>     <chr>     <lgl>              <int>
## 1 2020      FT        FALSE               1986
## 2 2020      FT        TRUE                 236
## 3 2020      PT        FALSE                830
## 4 2020      PT        TRUE                 697
## 5 2021      FT        FALSE               1813
## 6 2021      FT        TRUE                 275
## 7 2021      PT        FALSE                821
## 8 2021      PT        TRUE                 465

Hours earned equate to credits earned and is the indication of the progress a student is making towards a degree. In term year 2020, 240 full time and 727 part time students earned less than 3 credits. In term year 2021, 290 full time and 504 part time earned less than 3 credits.

Students who earned more than 18 credits

df_Degrees%>% group_by(term_year,full_part)%>%
                      count(hours_earned>18)
## # A tibble: 8 x 4
## # Groups:   term_year, full_part [4]
##   term_year full_part `hours_earned > 18`     n
##   <chr>     <chr>     <lgl>               <int>
## 1 2020      FT        FALSE                1986
## 2 2020      FT        TRUE                  236
## 3 2020      PT        FALSE                1490
## 4 2020      PT        TRUE                   37
## 5 2021      FT        FALSE                1896
## 6 2021      FT        TRUE                  192
## 7 2021      PT        FALSE                1248
## 8 2021      PT        TRUE                   38

3.5 Create New Variable hours_earned_rate

# create variable pass_earned_hours
# hours_earned_rate = hours_earned/hours_attempted

df_Degrees<-df_Degrees %>%
           mutate(hours_earned_rate=hours_earned/hours_attempted)

Density plot of Hours Earned Rate by year

ggplot(df_Degrees, aes(hours_earned_rate, fill = term_year)) + geom_density(alpha = 0.3) +
  facet_wrap(~full_part)+
  xlab("Hours Earned Rate") +
  ylab( "Density")+
  xlim(0,1)
## Warning: Removed 4 rows containing non-finite values (stat_density).

3.6 Create New Variable unearned_hours

Unearned hours are attempted hours that the student did not receive credit for.

df_Degrees<-df_Degrees %>%
           mutate(unearned_hours = hours_attempted-hours_earned)

Density plot of unearned hours per a year

ggplot(df_Degrees, aes(unearned_hours, fill = term_year)) + geom_density(alpha = 0.3) +
  facet_wrap(~full_part)+
  xlab("Unearned hours") +
  ylab( "Density") +
  xlim (0,18)
## Warning: Removed 31 rows containing non-finite values (stat_density).

4 Research Questions

4.1 Hours Earned as a Measure of Progress

Earned credits (hours earned) are the only indication whether a student is making progress towards a degree.

Create New dataframe to clean hours earned and hours attempted.

df_FTDegrees <-df_Degrees %>%
               filter(full_part=="FT")

df_FTDegrees%>% group_by(term_year)%>%
                count(term_year)
## # A tibble: 2 x 2
## # Groups:   term_year [2]
##   term_year     n
##   <chr>     <int>
## 1 2020       2222
## 2 2021       2088

There are 2252 full time degree seeking students in term year 2020. There are 2147 full time degree seeking students in term year 2021.

12 credits is the minimum to qualify as a fulltime student. Check dataset for consistency. Remove if less than 12 hours attempted. Create New dataframe to clean hours earned and hours attempted.

df_FTDegrees%>% group_by(term_year)%>%
                count(hours_attempted<12)
## # A tibble: 4 x 3
## # Groups:   term_year [2]
##   term_year `hours_attempted < 12`     n
##   <chr>     <lgl>                  <int>
## 1 2020      FALSE                   1952
## 2 2020      TRUE                     270
## 3 2021      FALSE                   1934
## 4 2021      TRUE                     154

Create a dataframe for these fulltime students who attempted less than 12 credit. Data framed called df_FT_lessC

#remove certificates, NA refers to degree majors which have not been named to protect student privacy.
df_FT_lessC<-subset(df_FTDegrees, hours_attempted<12)

df_FT_lessC%>% group_by(term_year)%>%
                count(hours_attempted<12)
## # A tibble: 2 x 3
## # Groups:   term_year [2]
##   term_year `hours_attempted < 12`     n
##   <chr>     <lgl>                  <int>
## 1 2020      TRUE                     270
## 2 2021      TRUE                     154

There were 276 fulltime students in term year 2020 who attempted less than 12 hours. There were 162 full time students in term year 2021 who attempted less than 12 hours.

create data frame for students who earned less than 3credits or more than 18 credits. The college offers very few 2credit courses. Students need special written permission to register for more than 18credits. Students with more than 18credits could be getting AP credits.

#remove certificates, NA refers to degree majors which have not been named to protect student privacy.
df_FT_outlier1<-subset(df_FTDegrees, hours_earned<3 )

df_FT_outlier1%>% group_by(term_year)%>%
                count(hours_earned<3)
## # A tibble: 2 x 3
## # Groups:   term_year [2]
##   term_year `hours_earned < 3`     n
##   <chr>     <lgl>              <int>
## 1 2020      TRUE                 236
## 2 2021      TRUE                 275

240 students earned less than 3 credits in 2020 and 290 students earned less than 3 credits in 2021.

I will drop these students from the dataframe.

#drop students who earned less than 3 credits.
df_FTDegrees<-df_FTDegrees%>% 
                subset(hours_earned>2)

df_FTDegrees%>% group_by(term_year)%>%
                count(term_year)
## # A tibble: 2 x 2
## # Groups:   term_year [2]
##   term_year     n
##   <chr>     <int>
## 1 2020       1986
## 2 2021       1813

Plot boxplot of distribution

p14 = ggplot(df_FTDegrees, aes(hours_earned))
p14 + geom_boxplot(aes(colour = term_year)) +
       facet_wrap(~full_part)

Hours Earned Fivenum Summary

df_FTDegrees%>% group_by(term_year,full_part)%>%
  summarise(n = n(),
            min = fivenum(hours_earned)[1],
            Q1 = fivenum(hours_earned)[2],
            median = fivenum(hours_earned)[3],
            Q3 = fivenum(hours_earned)[4],
            max = fivenum(hours_earned)[5])
## `summarise()` regrouping output by 'term_year' (override with `.groups` argument)
## # A tibble: 2 x 8
## # Groups:   term_year [2]
##   term_year full_part     n   min    Q1 median    Q3   max
##   <chr>     <chr>     <int> <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1 2020      FT         1986     3     7     10    13    53
## 2 2021      FT         1813     3     7     12    14    54

drop students who earned more than 18credits.

#drop students who earned more than 18 credits.
df_FTDegrees<-df_FTDegrees%>% 
                subset(hours_earned<19)

df_FTDegrees%>% group_by(term_year)%>%
                count(term_year)
## # A tibble: 2 x 2
## # Groups:   term_year [2]
##   term_year     n
##   <chr>     <int>
## 1 2020       1750
## 2 2021       1621

Check GPA

Plot boxplot of distribution

p14 = ggplot(df_FTDegrees, aes(hours_earned))
p14 + geom_boxplot(aes(colour = term_year)) +
       facet_wrap(~full_part)

Hours Earned Fivenum Summary

df_FTDegrees%>% group_by(term_year,full_part)%>%
  summarise(n = n(),
            min = fivenum(hours_earned)[1],
            Q1 = fivenum(hours_earned)[2],
            median = fivenum(hours_earned)[3],
            Q3 = fivenum(hours_earned)[4],
            max = fivenum(hours_earned)[5])
## `summarise()` regrouping output by 'term_year' (override with `.groups` argument)
## # A tibble: 2 x 8
## # Groups:   term_year [2]
##   term_year full_part     n   min    Q1 median    Q3   max
##   <chr>     <chr>     <int> <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1 2020      FT         1750     3     6      9    12    18
## 2 2021      FT         1621     3     7     10    13    18

Density plot of hours_earned by year

ggplot(df_FTDegrees, aes(hours_earned, fill = term_year)) + geom_density(alpha = 0.2) +
  facet_wrap(~full_part)+
  xlab("Hours Earned") +
  ylab( "Density")

Question: Who are these students who make up this distribution.

EDA Questions: Will use degree students only. A) How has the pandemic impacted student enrollment? 1. Difference in enrollement numbers, demographics (race, gender, highschool) in term year 2021 and term year 2020. 2. Did students in term year 2021 attempt more or less hours than students in term year 2020. 3. Did students in term year 2021 earn more or less hours than students in term year 2020.

  1. Hours earned is a measure of progress towards a degree.
  1. Which subgroup is the best predictor of the population for hours earned. i.e Subgroup most like the population. ( Smallest Effect Size.) Explore -Race -Gender -High School -Pell Grant